April 20, 2018

Hi!

I’m Daniel

Thanks

Community

Y’all

Pandas For Everyone

I’m an Author! :O

Doing Data Science

R! and Python!

We’re all friends

What is it?

Tools Used

What is it?

Tasks Performed

Last year…

Structure your projects!

(Computational Biology) Project Structure

Script it!

cd $1
mkdir doc data src bin results

cd doc
echo "Doc directory with one subdirectory per manuscript" > README
touch .gitkeep

cd ../data
echo "Data directory for storing fixed data sets" > README
touch .gitkeep

cd ../src
echo "src for source code" > README
touch .gitkeep

cd ../bin
echo "bin for compiled binaries or scripts" > README
touch .gitkeep

cd ../results
echo "Results directory for tracking computational experiments peformed on data" > README
touch .gitkeep

echo "Folders created."

cd ..

Best Practices … (2014)

Good enough practices … (2017)

tl;dr

Best Practices
  1. Write programs for people, not computers
  2. Let the computer do the work
  3. Make incremental changes
  4. Don’t repeat yourself (or others)
  5. Plan for mistakes
  6. Optimize software only after it works correctly
  7. Document design and purpose, not mechanics
  8. Collaborate
Good Enough
  1. Data management
  2. Software
  3. Collaboration
  4. Project Organization
  5. Keeping track of changes
  6. Manuscripts

Summer Program

+ ~20 people into the lab … to stress test!

Blog post: From VMs to LXC Containers to Docker Containers


  • fully document the setup process
  • tear down and spin up the container if something goes wrong
  • system libraries for R packages
  • easy to try out a new technology before full integration

Infrastructure

Project Template

Installing R Packages

  • Separate Docker Container for installing R packages (rpkgs)
  • This installs R packages into a persistent docker volume
  • Everyone has /rpkgs mounted in their rstudio container
  • Add /rpkgs to everyone’s .libPaths()
site_path = R.home(component = "home")
fname = file.path(site_path, "etc", "Rprofile.site")
write(".libPaths(c('/rpkgs', .libPaths()))", file = fname, append = TRUE) # prepend to .libPaths
write('local({r <- getOption("repos"); r["CRAN"] <- "https://cloud.r-project.org/"; options(repos=r)})',
      file = fname, append = TRUE)

Installing R Packages (Development Server)

Installing R Packages (Production Server)

  1. Pull the images from dockerhub, docker pull
docker pull sdal/mro-c7sd_auth
docker pull sdal/rss-mro-c7sd_auth
docker pull sdal/rpkgs-mro-c7sd_auth
docker pull sdal/shy-mro-c7sd_auth
  1. Start up the RStudio containers, docker-compose -f rstudio-compose.yml up -d --no-recreate

RStudio Server

  • Open Source Edition great for individual use
    • Pretty much building the Pro stack…
    • nginx
      • web server: reverse proxy, load balancer, and HTTP cache
  • Exploring RStudio Pro… much better suited for parallel projects, groups, and teams (?)

RStudio Server for Everyone

docker-compose.yml/rstudio-compose.yml:

volumes:
  rpkgs:

services:
  rstudio_chend:
    image: sdal/rss-mro-c7sd_auth
    container_name: rstudio_chend
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:ro
      - /etc/group:/etc/group
      - /home:/home
      - rpkgs:/rpkgs
      - checkpoint:/checkpoint
    cap_add:
      - SYS_ADMIN
    ports:
      - 3125:8787

Authorship

  • Collaborative \(\LaTeX\) with real-time rendering… with Git!

Virginia Tech Libraries provides free Overleaf Pro+ accounts

Sharing Documenting Projects



The Skills

Git is hard & Shell (Bash) is the glue

Balancing best/good practices…

… with getting work done


Titus Brown
Associate Professor at UC Davis




Write Functions! in your projects

  • Software-Carpentry Lessons
  • DataCamp has an entire course
  • No need to package up functions that you won’t be using again
  • But when you start to use them in multiple projects…
    • Expand your function to be more general and robust
    • Move it in to a package.

No setwd()!

Saving things

  • Save out long calculations into intermediate datasets
  • Use base::saveRDS() and base::readRDS()
    • vs base::save() and base::load()
v <- 1:10 # I want to save this...

save(v, file = 'awesome_datascience.RData')
rm(v)
load(file = 'awesome_datascience.RData')
v
##  [1]  1  2  3  4  5  6  7  8  9 10
saveRDS(v, file = 'super_awesome_datascience.RDS')
loaded <- readRDS(file = 'super_awesome_datascience.RDS')
loaded
##  [1]  1  2  3  4  5  6  7  8  9 10

Secrets

  • Hardcoding secrets (e.g., passwords, API keys) in your code
  • Reverting lines before a commit
  • Sourcing a special ignored file
# uses console or rstudio to do password prompt
getPass::getPass("database username")

Secret Library

.secret_to_keep <- function(user, pass) {
    if (is.null(pass)) {
        pass <- getPass("LDAP Password (the one you use to login to Lightfoot and RStudio):")
    }
    secret_to_keep <- c(password = pass,
                        username = user)
    return(secret_to_keep)
}

setup_user_pass <- function(username = unname(Sys.info()['user']),
                            password = NULL,
                            public_key = '~/.ssh/id_rsa.pub',
                            vault = '/home/sdal/projects/sdal/vault',
                            secret_name = unname(Sys.info()['user']),
                            verbose = FALSE) {
    add_user(username, public_key, vault)
    secret_to_keep <- .secret_to_keep(user = username, pass = password)
    add_secret(secret_name, secret_to_keep, users = username, vault = vault)
}

get_my_password <- function(secret_name = unname(Sys.info()['user']),
                            key = local_key(),
                            vault = '/home/sdal/projects/sdal/vault') {
    return(unname(get_secret(secret_name, key , vault)['password']))
}

Package!

Testing

Cookbooks

RMarkdown

  • This presentation is written in it!
  • I’ve given Meetup Talks about it: NYC, DC
  • Websites, Books, Presentations, Dashboards, Reports…

Graphics Cookbook

A Better Default Colormap for Matplotlib

SciPy 2015 | Nathaniel Smith and Stéfan van der Walt

Virdis

Perceptual Color Maps in matplotlib for Oceanography

SciPy 2015 | Kristen Thyng

Colorbrewer

Other things

  • Cookbooks for you own usecases (e.g., GIS)
  • Make your own “flight rules”

Flight Rules are the hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs, and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. […]

NASA has been capturing our missteps, disasters and solutions since the early 1960s, when Mercury-era ground teams first started gathering “lessons learned” into a compendium that now lists thousands of problematic situations […] and their solutions.

— Chris Hadfield, An Astronaut’s Guide to Life.

Why?

Our planet needs our help, and we need (good) science to fix it.

— Greg Wilson

Thanks, again!

:) #rstatsnyc #nycdatamafia